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Abstract 

We propose a label propagation approach to 
geolocation prediction based on Modified Ad¬ 
sorption, with two enhancements: (1) the re¬ 
moval of “celebrity” nodes to increase location 
homophily and boost tractability; and (2) the 
incorporation of text-based geolocation priors 
for test users. Experiments over three Twitter 
benchmark datasets achieve state-of-the-art re¬ 
sults, and demonstrate the effectiveness of the 
enhancements. 

1 Introduction 


Geolocation of social media users is essential in applica- 


tions ranging from rapid disaster response (Earle et al.. 

20T0) 

Ashktorab et al., 2014[ 

Morstatter et al., 2013a 1 

and opinion analysis (Mostafa, 2013 

Kirilenko and 

Stepchenkova, 20l4||, to recommender systems ( 

Noulas 

et al., 2012[|Schedl and Schnitzer, 2014 

. Social media 


platforms like Twitter provide support for users to de¬ 
clare their location manually in their text profile or au¬ 
tomatically with GPS-based geotagging. However, the 
text-based profile locations are noisy and only 1-3% of 
geotagged 
), meaning 

ferred from other information sources such as the tweet 
text and network relationships. 

User geolocation is the task of inferring the pri¬ 
mary (or “home”) location of a user from available 
sources of information, such as text posted by that in¬ 
dividual, or network relationships with other individ¬ 
uals dHan et al., 2014| ). Geolocation models are usu¬ 
ally trained on the small set of users whose location is 
known (e.g. through GPS-based geotagging), and other 
users are geolocated using the resulting model. These 
models broadly fall into two categories: text-based and 
network-based methods. Orthogonally, the geolocation 
task can be viewed as a regression task over real-valued 
geographical coordinates, or a classification task over 
discretised region-based locations. 


tweets are 


al., 2013b 


( Cheng et al., 2010} Morstatter et 


that geolocation needs to be in- 


Most previous research on user geolocation has fo¬ 
cused either on text-based classification approaches 
( Eisenstein et al., 2010| Wing and Baldridge, 2011] 
Roller et al., 2012 Han et al., 2014| | or, to a lesser 


extent, network-based regression approaches (Jurgens 
20131 Compton et al., 2014 Rahimi et al., 2015] ). Meth¬ 
ods which combine the two, however, are rare. 

In this paper, we present our work on Twitter user ge¬ 
olocation using both text and network information. Our 
contributions are as follows: (1) we propose the use of 


Modified Adsorption (Talukdar and Crammer, 20091 as 
a baseline network-based geolocation model, and show 
that it outperforms previous network-based approaches 
( [Jurgens, 2013[ Rahimi et al., 20151; (2) we demonstrate 
that removing “celebrity” nodes (nodes with high in¬ 
degrees) from the network increases geolocation accu¬ 
racy and dramatically decreases network edge size; and 
(3) we integrate text-based geolocation priors into Mod¬ 
ified Adsorption, and show that our unified geolocation 
model outperforms both text-only and network-only ap¬ 
proaches, and achieves state-of-the-art results over three 
standard datasets. 


2 Related Work 

A recent spike in interest on user geolocation over so¬ 
cial media data has resulted in the development of a 
range of approaches to automatic geolocation predic¬ 
tion, based on information sources such as the text of 
messages, social networks, user profile data, and tem¬ 
poral data. Text-based methods model the geograph¬ 
ical bias of language use in social media, and use it 
to geolocate non-geotagged users. Gazetted expres¬ 


sions (Eeidner and Eieberman, 20111 and geographi 


cal names (Quercini et al., 20101 were used as feature 


in early work, but were shown to be sparse in cover¬ 
age. Han et al. (2014| used information-theoretic meth¬ 
ods to automatically extract location-indicative words 


for location classification. Wing and Baldridge (20141 


reported that discriminative approaches (based on hier- 
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archical classification over adaptive grids), when opti¬ 
mised properly, are superior to explicit feature selection. 
Cha et al. (2015|) showed that sparse coding can be used 


to effectively learn a latent representation of tweet text 
to use in user geolocation. |Eisenstein et al. (2010[) and 


Ahmed et al. (20131 proposed topic model-based ap¬ 


proaches to geolocation, based on the assumption that 
words are generated from hidden topics and geograph¬ 
ical regions. Similarly, Yuan et al. (2013| ) used graph¬ 
ical models to jointly learn spatio-temporal topics for 
users. The advantage of these generative approaches is 
that they are able to work with the continuous geograph¬ 
ical space directly without any pre-discretisation, but 
they are algorithmically complex and don’t scale well to 
larger datasets. Hulden et al. (20151 used kernel-based 
methods to smooth linguistic features over very small 
grid sizes to alleviate data sparseness. 

Network-based geolocation models, on the other 
hand, utilise the fact that social media users interact 
more with people who live nearby. Jurgens (20131 and 
Compton et al. (2014[) used a Twitter reciprocal men¬ 


tion network, and geolocated users based on the geo¬ 
graphical coordinates of their friends, by minimising the 
weighted distance of a given user to their friends. For a 
reciprocal mention network to be effective, however, a 


huge amount of Twitter data is required. Rahimi et al. 


(2015|l showed that this assumption could be relaxed to 


use an undirected mention network for smaller datasets, 
and still attain state-of-the-art results. The greatest 
shortcoming of network-based models is that they com¬ 
pletely fail to geolocate users who are not connected 
to geolocated components of the graph. As shown by 


Rahimi et al. (20151, geolocation predictions from text 


can be used as a backoff for disconnected users, but 
there has been little work that has investigated a more 
integrated text- and network-based approach to user ge¬ 
olocation. 


users are held out for each of development and test data. 
The primary location of each user is set to the coordi¬ 
nates of their first tweet. 

Twitter-US consists of 449K users, of which lOK 
users are held out for each of development and test data. 
The primary location of each user is, once again, set to 
the coordinates of their first tweet. 

Twitter-World consists of 1.3M users, of which 
10000 each are held out for development and test. Un¬ 
like the other two datasets, the primary location of users 
is mapped to the geographic centre of the city where the 
majority of their tweets were posted. 

4 Methods 

We use label propagation over an ©-mention graph in 
our models. We use k-A tree descretised adaptive grids 
as class labels for users and learn a label distribution 
for each user by label propagation over the ©-mention 
network using labelled nodes as seeds. For k-A tree dis¬ 
cretisation, we set the number of users in each region 
to 50, 2400, 2400 for GeoText, Twitter-US and 
Twitter-World respectively, based on tuning over 
the development data. 

Social Network: We used the ©-mention information 
to build an undirected graph between users. In order 
to make the inference more tractable, we removed all 
nodes that were not a member of the training/test set, 
and connected all pairings of training/test users if there 
was any path between them (including paths through 
non training/test users). We call this network a “col¬ 
lapsed network”, as illustrated in Figure [T] Note that a 
celebrity node with n mentions connects n{n — l) nodes 
in the collapsed network. We experiment with both bi¬ 
nary and weighted edge (based on the number of men¬ 
tions connecting the given users) networks. 


3 Data 

We evaluate our models over three pre-existing geo- 


Baseline : Our baseline geolocation model (“MAD - B”) 
is formulated as label propagation over a binary col- 


tagged Twitter datasets: (1) GeoText (Fisenstein et ah, lapsed network, based on Modified Adsorption (Taluk 


2010| ), (2) Twitter-US ( [Roller et al, 2012) , and (3) 
Twitter-World ( |Han et ah, 2012 1 . In each dataset, 
users are represented by a single meta-document, gen¬ 
erated by concatenating their tweets. The datasets are 
pre-partitioned into training, development and test sets, 
and rebuilt from the original version to include mention 
information. The first two datasets were constructed to 
contain mostly English messages. 

GeoText consists of tweets from 9.5K users: 1895 


dar and Crammer, 20091. It applies to a graph G = 


(U, E, W) where V is the set of nodes with |U| = n = 
ni + Uu (where n; nodes are labelled and nodes are 
unlabelled), E is the set of edges, and W is an edge 
weight matrix. Assume C is the set of labels where 
|C| = m is the total number of labels. Y is an n x m 
matrix storing the training node labels, and Y is the es¬ 
timated label distribution for the nodes. The goal is to 
estimate Y for all nodes (including training nodes) so 































©-mention Network 




Collapsed Network plus Dongle Nodes 


Figure 1: A collapsed network is built from the ©-mention network. Each mention is shown by a directed arrow, 
noting that as it is based exclusively on the tweets from the training and test users, it will always be directed from 
a training or test user to a mentioned node. All mentioned nodes which are not a member of either training or 
test users are removed and the corresponding training and test users, previously connected through that node, are 
connected directly by an edge, as indicated by the dashed lines. Mentioned nodes with more than T unique mentions 
(celebrities, such as m 3 ) are removed from the graph. To each test node, a dongle node that carries the label from 
another learner (here, text-based LR) is added in MADCEL-B-LR and MADCEL-W-LR. 


that the following objective function is minimised: 


cm = E 

i 


fj.i{Yi-YifS{Yi-Yi)+ 


lJi2Yi^LYi 


where and ^2 are hyperparametersQ L is the 
Laplacian of an undirected graph derived from G; and 
S' is a diagonal binary matrix indicating if a node is 
labelled or not. The first term of the equation forces the 
labelled nodes to keep their label (prior term), while 
the second term pulls a node’s label toward that of its 
neighbours (smoothness term). For the first term, the 
label confidence for training and test users is set to 1.0 
and 0.0, respectively. Based on the development data, 
we set /ii and ^2 to 1.0 and 0.1, respectively, for all 
the experiments. For Twitter-US and Twitter- 
WORLD, the inference was intractable for the default 
network, as it was too large. 


There are two immediate issues with the baseline 
graph propagation method: ( 1 ) it doesn’t scale to large 
datasets with high edge counts, related to which, it tends 
to be biased by highly-connected nodes; and ( 2 ) it can’t 
predict the geolocation of test users who aren’t con¬ 
nected to any training user (MAD-B returns Unknown, 

'in the base formulation of MAD-B, there is also a regularisation 
term with weight /rs, but in all our experiments, we found that the 
best results were achieved over development data with ^13 = 0 , i.e. 
with no regularisation; the term is thus omitted from our description. 


which we rewrite with the centre of the map). We re¬ 
dress these two issues as follows. 

Celebrity Removal To address the first issue, we 
target “celebrity” users, i.e. highly-mentioned Twitter 
users. Edges involving these users often carry little or 
no geolocation information (e.g. the majority of peo¬ 
ple who mention Barack Obama don’t live in Washing¬ 
ton D.C.). Additionally, these users tend to be highly 
connected to other users and generate a disproportion¬ 
ately high number of edges in the graph, leading in 
large part to the baseline MAD-B not scaling over large 
datasets such as Twitter-US and Twitter-World. 
We identify and filter out celebrity nodes simply by as¬ 
suming that a celebrity is mentioned by more than T 
users, where T is tuned over development data. Based 
on tuning over the development set of GeoText and 
Twitter-US, T was set to 5 and 15 respectively. For 
Twitter-World tuning was very resource intensive 
so T was set to 5 based on GeoText, to make the 
inference faster. Celebrity removal dramatically re¬ 
duced the edge count in all three datasets (from 1 x 10^ 
to 5 X 10® for Twitter-US and from 4 x 10^° to 
1 X 10^ for Twitter-World), and made inference 
tractable for Twitter-US and Twitter-World. iJiH 
rgens et al. (2015| ) report that the time complexity of 
most network-based geolocation methods is 0{hP‘) for 
each node where k is the average number of vertex 
neighbours. In the case of the collapsed network of 
Twitter-World, k is decreased by a factor of 4000 
after setting the celebrity threshold T to 5. We apply 
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Acc@161 

Mean 

Median 

Acc@161 

Mean 

Median 

Acc@161 

Mean 

Median 

MAD-B 

50 

683 

146 

XXX 

XXX 

XXX 

XXX 

XXX 

XXX 

MADCEL-B 

56 

609 

76 

54 

709 

117 

70 

936 

0 

MADCEL-W 

58 

586 

60 

54 

705 

116 

71 

976 

0 

MADCEL-B-LR 

57 

608 

65 

60 

533 

77 

72 

786 

0 

MADCEL-W-LR 

59 

581 

57 

60 

529 

78 

72 

802 

0 


LR ■ 
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., 2015 



38 

880 

397 

50 

686 

159 

63 

866 

19 

LP ■ 

Rahimi et al., 2015 



45 

676 

255 

37 

747 

431 

56 

1026 

79 

LP- 

LR' 

Rahimi et al., 2 

015 


50 

653 

151 

50 

620 

157 

59 

903 

53 

Wing an 

d Baldridge (2014 

(uniform) 

— 

— 

— 

49 

703 

170 

32 

1714 

490 

Wing and Baldridge (2014 

(/c-d) 

— 

— 

— 

48 

686 

191 

31 

1669 

509 

Han et al. (2012 





— 

— 

— 

45 

814 

260 

24 

1953 

646 

Ahmed et al. (2C 
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??? 

298 

— 

— 

— 

— 

— 

— 

Cha etal. (2015; 





??? 

581 

425 

— 

— 

— 

— 

— 
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Table 1: Geolocation results over the three Twitter corpora, comparing baseline Modified Adsorption (MAD-B), 
with Modified Adsorpfion wifh celebrify removal (MADCEL-B and MADCEL-W, over binary and weighted nef- 
works, resp.) or celebrify removal plus texf priors (MADCEL-B-LR and MADCEL-W-LR, over binary and weighted 
nefworks, resp.); the table also includes state-of-the-art results for each dataset (“—” signifies fhaf no resulfs were 
published for fhe given dafaset; “???” signifies fhaf no resulfs were reported for the given metric; and “x x x” 
signifies fhaf resulfs could nol be generated, due fo fhe infracfabilify of fhe fraining dafa). 
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A Unified Geolocation Model To address the issue 
of disconnected test users, we incorporate text informa¬ 
tion into the model by attaching a labelled dongle node 


to every test node (Zhu and Ghahramani, 2002 Gold¬ 


berg and Zhu, 20061. The label for the dongle node is 


based on a text-based li regularised logistic regression 
model, using the method of Rahimi et al. (20151. The 
dongle nodes with their corresponding label confidences 
are added to the seed set, and are treated in the same 
way as other labelled nodes (i.e. the training nodes). 
Once again, we experiment with text-based labelled 
dongle nodes over both binary (“MADCEL-B-LR”) and 
weighted (“MADCEL-W-LR”) networks. 


Figure 2: Effect of celebrity removal on geolocation 
performance and graph size. For each T performance 
is measured over the development set of Twitter-US 
by MADCEL-W. 


celebrity removal over both binary (“MADCEL-B”) and 
weighted (“MADCEL-W”) networks (using the respec¬ 
tive T for each dataset). The effect of celebrity removal 
over the development set of Twitter-US is shown in 
Figure where it dramatically reduces the graph edge 
size and simultaneously leads to an improvement in the 
mean error. 


5 Evaluation 


Following Cheng et al. (20101 and Eisenstein et al. 


(20101, we evaluate using the mean and median error (in 


km) over all test users (“Mean” and “Median”, resp.), 
and also accuracy within 161km of the actual location 
(“Acc@161 ”). Note that higher numbers are better for 
Acc@161, but lower numbers are better for mean and 
median error, with a lower bound of 0 and no (theoreti¬ 
cal) upper bound. 


To generate a continuous-valued latitude/longitude 
coordinate for a given user from the k-A tree cell, we 
use the median coordinates of all training points in the 
predicted region. 













































6 Results 


Table[T]shows the performance of MAD-B, MADCEL-B, 
MADCEL-W, MADCEL-B-LR and MADCEL-W-LR over 
the GeoText, Twitter-US and Twitter-World 
datasets. The results are also compared with prior 
work on network-based geolocation using label prop¬ 


agation (LP) (Rahimi et al., 20151, text-based classifi 
cation models ( |Han et al., 2012} [Wing and Baldridge^ 
201 H Wing and Baldridge, 20i4j Rahimi et al., 


2015} Cha et al., 2015|l, text-based graphical mod¬ 


els (Ahmed et al., 20131, and network-text hybrid mod¬ 


els (LP-LR) ( [Rahimi et al, 2015) . 

Our baseline network-based model of MAD-B outper¬ 
forms the text-based models and also previous network- 
based models ( Jurgens, 2013} Compton et al., 2014 
Rahimi et al., 20f5]|. The inference, however, is in¬ 


tractable for Twitter-US and Twitter-World due 
to the size of the network. 

Celebrity removal in MADCEL-B and MADCEL-W 
has a positive effect on geolocation accuracy, and re¬ 
sults in a 47% reduction in Median over GeoText. 
It also makes graph inference over Twitter-US and 
Twitter-World tractable, and results in superior 
Acc(®161 and Median, but slightly inferior Mean, 
compared to the state-of-the-art results of LR, based on 
text-based classification ( [Rahimi et al., 2015] ). 

MADCEL-W (weighted graph) outperforms 
MADCEL-B (binary graph) over the smaller Geo¬ 
Text dataset where it compensates for the sparsity of 
network information, but doesn’t improve the results 
for the two larger datasets where network information 
is denser. 

Adding text to the network-based geolocation mod¬ 
els in the form of MADCEL-B-LR (binary edges) and 
MADCEL-W-LR (weighted edges), we achieve state- 
of-the-art results over all three datasets. The inclu¬ 
sion of text-based priors has the greatest impact on 
Mean, resulting in an additional 26% and 23% error re¬ 
duction over Twitter-US and Twitter-World, re¬ 
spectively. The reason for this is that it provides a user- 
specific geolocation prior for (relatively) disconnected 
users. 


more tractable); and (b) incorporating text-based geolo¬ 
cation priors into the model. 

As future work, we plan to use temporal data and also 
look at improving the text-based geolocation model us¬ 
ing sparse coding ( Cha et al., 2015[ |. We also plan to in¬ 
vestigate more nuanced methods for differentiating be¬ 
tween global and local celebrity nodes, to be able to fil¬ 
ter out global celebrity nodes but preserve local nodes 
that can have high geolocation utility. 
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